Welcome to lab 4, where we will explore the field of Spatial Transcriptomics. Many of the ideas that you encountered when working with single cell data will be revistied but now with the advantage that we can visualize our results in the real physical space (not just UMAP or tSNE space).
The field of spatial transcriptomics has grown rapidly, and multiple techniques to obtain information of spatial gene expression exists. Some examples are : MERFISH, ISS, osmFISH, seqFISH, HDST, baristaSeq, Slide-seq, and GeoMx. We will however be focusing on spatial data generated from the Visium platform (sold by 10x Genomics). Visium is the successor to the technique - somewhat confusingly - named Spatial Transcriptomics (commonly referred to as ST).
ST was developed at SciLifeLab and presented to the world in 2016 when the publication Visualization and analysis of gene expression in tissue sections by spatial transcriptomics was published in the journal Science. In December 2018, 10x Genomics aquired the IP rights to the ST-technique, they then launched the Visium platform in late 2019.
Both ST and Visium utilize a solid array onto which oligonucleotides with spatial barcodes have been printed at locations (spots) arranged in a regular grid. These oligonucleotides all have a poly-T sequence, making them apt to capture mRNA’s by their poly-A tail. By reverse transcription, the barcodes of the oligonucleotides will be embedded in cDNA synthesized from the captured mRNA’s; hence, we know at which mRNA that each transcript was captured. Once the cDNA molecules are sequenced, we can backmap them to their spatial position - using the barcodes - and by doing so obtain spatial gene expression information. The “old” ST arrays had 1000 spots printed on the array, while the newer Visium slides has 5000 spots.
One key fact that should be emphasized is how none of these methods (ST and Visium) operate on a single cell level yet. The gene expression data associated with each spot is really a mixture of contributions from multiple cells, not all necessarliy of the same cell type.
We begin by configuring our knitr settings; in short, we set the default values for our code chunks - which will affect the behavior and appearance of the knitted file. Do not change this code chunk.
When handing in you report, set the variable GRADE_MODE to TRUE
GRADE_MODE <- TRUE
Next, let us load some the libraries that we will use in this lab. The code chunk has been written as to first check whether you have the libraries installed, and if not then install them before loading them. You do not have to pay attention to what’s going on in this chunk unless you want to.
In this lab, we will be examining one tissue section from the mouse brain (coronal). 10x - the company behind Visium - kindly provides some “example” datasets on their webpage, which is where this sample is taken from. As you might remember, the expression data is only one component of the ST and Visium assays, we also have an image of the very same (stained by Hematoxylin and Eosin) tissue section to use as a reference or aid in our analysis. The HE-image for our data is shown below.
HE-image of the Mouse Brain Section
Now when we know what our tissue looks like, let us continue with the expression data. When downloaded from 10x’s website, the data is stored in either a .mtx or .h5 format, but to make things a bit easier for you these have been cast into the good old .tsv format, which you should be familiar with by now . Just as in the single cell lab, the data is represented by a count matrix with genes along one dimension and (important) spots along the other dimension. Run the next code chunk to load the aforementioned count matrix, and also inspect it.
# set path to data
DATA.PATH <- "data/mouse-brain.tsv.gz"
# read data
raw.data <- read.table(DATA.PATH,
header = T,
sep = '\t',
row.names = 1)
# inspect data
raw.data[1:5,1:5]
## Xkr4 Gm1992 Gm37381 Rp1 Sox17
## 545.377x1766.777 0 0 0 0 0
## 568.853x1766.777 0 0 0 0 0
## 580.59x1787.021 0 0 0 0 0
## 475.291x1725.951 0 0 0 0 1
## 533.81x1746.364 0 0 0 0 0
So what do we have here? Well, obviously the columns represent genes and thus the rows must represent our spots. Now, if you have a look at the rownames, you will see how there are a lot of numbers in there. The rownames actually follow a given pattern, being: [x-coordinate]x[y-coordinate]. For example, for the first spot the x-coordinate is 545.377 and the y-coordinate is 1766.777.
The next thing we want to do is to extract each spot’s coordinates, and store them in a new dataframe (spatial.data). For this we will use the function strsplit and sapply, if in doubt about what any of these functions actually do, use the ?function_name feature in R!
# Extract x-coordinates
xcrd <- as.numeric(sapply(rownames(raw.data), function(x){strsplit(x,"x")[[1]][1]}))
# Extract y-coordinates
ycrd <- as.numeric(sapply(rownames(raw.data), function(x){strsplit(x,"x")[[1]][2]}))
# Create data frame
spatial.data <- data.frame(x = xcrd,
y = ycrd)
# set rownames
rownames(spatial.data) <- rownames(raw.data)
#inspect the data frame
head(spatial.data)
## x y
## 545.377x1766.777 545.377 1766.777
## 568.853x1766.777 568.853 1766.777
## 580.59x1787.021 580.590 1787.021
## 475.291x1725.951 475.291 1725.951
## 533.81x1746.364 533.810 1746.364
## 662.584x1725.951 662.584 1725.951
We can easily check, that our strategy to obtain the spatial coordinates for each spot worked. The x and y coordinates (columns) should correspond to their respective rownames - which they seem to do.
As a second sanity check, we may also plot the spots in 2D space and see whether their arrangement looks like something we might expect (= agreeing with the tissue structure). To do this, simply run the code below:
g <- ggplot(data = spatial.data) +
geom_point(mapping = aes(x = x, y = y),
size = 0.8
) +
theme_void() +
theme(plot.margin = unit(c(0.05, 0.05, 0.05, 0.05), "npc"))+
coord_fixed()
print(g)
If you scroll back to the HE-image above, you can clearly see how the spots resemble the tissue outlines. This is positive, at least we now know that our data isn’t a complete mess, i.e., so far so good. With these affirimations, we may proceed to prepare the data (and ourselves) for some fun analysis. Similar to what we did in the single cell lab, we will use the Seurat suite, hence the next thing for us is to create a Seurat object.
Q0: Create a Seurat object named se using the CreateSeuratObject function with the following parameters:
t(raw.data)spatial.data"MouseBrain"spRNAComment1 : We do t(raw.data) since the count data we loaded was stored with samples as rows and genes as columns, which is the transpose (t) to the Seurat convention.
Comment2 : Note how set the spatial.data as meta data in our Seurat object. That gives us easy access to the coordinates.
Comment3 : The assay name “spRNA” is short for “spatial RNA”
# Create the Seurat object se here
#se <- NA
se <- CreateSeuratObject(counts = t(raw.data),
meta.data = spatial.data,
project = "MouseBrain",
assay = "spRNA"
)
se
## An object of class Seurat
## 31017 features across 2698 samples within 1 assay
## Active assay: spRNA (31017 features, 0 variable features)
We can now remove the raw.data object, since we’ve transferred all of it’s information to the Seurat object. This will free up some memory from our computers.
remove(raw.data)
Q1: Now when the Seurat object is formed, please answer the questions below:
Give your answers below
# Give the answers to the questions above
# by assigning the values (answers) to
# the variables ans1,ans2 and ans3
ans1 <- dim(se)[2]
ans2 <- dim(se)[1]
ans3 <- rownames(se)[1306]
cat(sprintf("No. Spots : %d \nNo genes: %d\n1306th gene : %s",ans1,ans2,ans3))
## No. Spots : 2698
## No genes: 31017
## 1306th gene : Gm37273
Comment : It might be a good idea to save these values (number of spots and genes) for later, either write them down or store them in specific variables.
So far we plotted the spots and then compared them to the tissue image, but scrolling back and forth is tedious and cumbersome, we can do better than that! Thanks to the setup of the Visium platform, the spots can (seamlessly) be overlaid on the tissue.
The code below will do exactly this - overlay the spots on the tissue - for you. You do not have to pay too much attention to the code, but it has been annotated to inform you of what is going on in case you are interested.
# load image
img <- readPNG("data/image.png")
# make a raster object from the image
# this is needed to plot the image together with
# the data
img.grob <- rasterGrob(img,
width = unit(1, "npc"),
height = unit(1, "npc"),
interpolate = TRUE)
# create a ggplot object
g <- ggplot(data = se@meta.data) +
# add the image as a background
annotation_custom(img.grob,
xmin = -Inf,
xmax = Inf,
ymin = -Inf,
ymax = Inf) +
# plot the spots
geom_point(mapping = aes(x = x, y = y),
size = 0.7,
color = "red") +
# adjust x-axis scaling
scale_x_continuous(limits = c(0, dim(img)[2]),
expand = c(0, 0)) +
# adjust y-axis scaling
scale_y_continuous(limits = c(0, dim(img)[1]),
expand = c(0, 0)) +
# remove uneccesary stuff (ticks and axes) from the plot
theme_void() +
# configure margin settings
theme(plot.margin = unit(c(0.05, 0.05, 0.05, 0.05), "npc"))+
# fix plot aspect ratio
coord_fixed()
print(g)
Now look at that, what a beauty! The spots and tissue overlap perfectly. Now when we know how to overlay or spots on our image, we can start playing around a bit more with our data!
The whole purpose of Visium is to associate gene expression to spatial locations, i.e., each spot has an expression value for every gene. To see this property in action, we will visualize the expression of gene called Wfs1 (a marker gene for pyramidial neurons). In short, this will be done in three steps, namely:
color.vector which holds the expression values of Wfs1 at each spotcolor.vector)# create vector holding expression values for Wfs1
color.vector <- as.numeric(se$spRNA["Wfs1",])
# create image grob
img.grob <- rasterGrob(img,
width = unit(1, "npc"),
height = unit(1, "npc"),
interpolate = TRUE)
# create ggplot object
g <- ggplot(data = se@meta.data) +
# add image as background
annotation_custom(img.grob,
xmin = -Inf,
xmax = Inf,
ymin = -Inf,
ymax = Inf) +
# add spots, color by expression
geom_point(mapping = aes(x = x, y = y,
color = color.vector),
pch = 20,
alpha = 0.3
) +
# choose color gradient for spots
scale_color_gradient(low = "white",
high = "red"
) +
# set labels of plot
labs(colour = "Expression Level",
title = "Wfs1 expression"
) +
# adjust x-axis scaling
scale_x_continuous(limits = c(0, dim(img)[2]),
expand = c(0, 0)) +
# adjust y-axis scaling
scale_y_continuous(limits = c(0, dim(img)[1]),
expand = c(0, 0)) +
# get rid of unnecessary stuff
theme_void() +
# add margins
theme(plot.margin = unit(c(0.05, 0.05, 0.05, 0.05), "npc"))+
# fix plot ratio
coord_fixed()
print(g)
What you can see here is how the expression of Wfs1 is mainly confined to a few distinct spatial regions. Perhaps the most prominent region is the upper one (with a slight bend), these signals overlap very well with the CA1 ( Cornu Ammonis 1 ) region of the mouse brain. This observation tells us that maybe, there are some pyramidal cells in the CA1. Actually, that’s a very well established fact, from Wikipedia:
The CA areas are all filled with densely packed Pyramidal cells similar to those found in the neocortex.
so this isn’t exactly breaking news perhaps, but it is an example of the sort of inferences one can make from the raw data itself.
Before we proceed, note how we have already used the “plot-spots-on-image” procedure twice, and probably will do it again. There is also quite a lot of code involved everytime we do this, and we only make small changes to it (like adjusting the color of our spots). To avoid loads of repetitive code and make things easier for us, we will therfore define a function spatial.plot that will take care of the plotting for us. This function will be designed to take the following parameters:
Again, you don’t have to spend tons of time on the code here, but give it a look and see if you sort of understand the workflow.
spatial.plot <- function(se,
img = NULL,
alpha = 1,
size = 1,
color.vector = NULL,
plot.title = "",
legend.title = "",
color.type = "continous"
) {
if (is.null(color.vector)) {
gp <- geom_point(mapping = aes(x = x,
y = y
),
color = "black",
pch = 20,
size = size,
alpha = alpha
)
} else {
gp <- geom_point(mapping = aes(x = x,
y = y,
color = color.vector
),
pch = 20,
size = size,
alpha = alpha
)
}
g <- ggplot(data = se@meta.data )
if (!(is.null(img))) {
img.grob <- rasterGrob(img,
width = unit(1, "npc"),
height = unit(1, "npc"),
interpolate = TRUE)
g <- g + annotation_custom(img.grob,
xmin = -Inf,
xmax = Inf,
ymin = -Inf,
ymax = Inf) +
scale_x_continuous(limits = c(0, dim(img)[2]),
expand = c(0, 0)) +
scale_y_continuous(limits = c(0, dim(img)[1]),
expand = c(0, 0))
}
g <- g + gp
if (!(is.null(color.vector))) {
if (color.type == "categorical") {
n_col = length(unique(color.vector))
g <- g+ scale_colour_manual(values = rainbow(n_col))
} else {
g <- g+ scale_color_gradient(low = ifelse(is.null(img),
"yellow",
"white"),
high = "red"
)
}
}
g <- g+ labs(colour = legend.title,
title = plot.title
) +
theme_void() +
theme(plot.margin = unit(c(0.05, 0.05, 0.05, 0.05), "npc"))+
coord_fixed()
return(g)
}
Of course, we now want to test our new function. We use spatial.plot to create the same plot as before, using the color.vector (expression of Wfs1) to color our spots.
gg <- spatial.plot(se,
img = img,
color.vector = color.vector
)
print(gg)
As you can see, this is a bit more conveinient to use than typing ~50 lines of code each time we want to plot something.
Q2: Armed with the spatial.plot function, it’s time for another question. One of the genes:
has a very localized expression pattern, it is mainly expressed in the region called the Dentate Gyrus (DG). Use the spatial.plot function to find out which of them it is. As an answer to this question, plot that particular gene using spatial.plot.
Comment : If your mouse brain anatomy is a bit rusty, the image below might help you to identify where exactly the Dentate Gyrus (DG) is.
Dentate Gyrus (DG) marked on HE-image
# Plot the gene highly expressed in the DG here
# using the spatial.plot function
dg.gene <- "Prox1"
dg.c.vec <- se@assays$spRNA[dg.gene,] %>% as.numeric
gg <- spatial.plot(se,
img,
color.vector = dg.c.vec
)
print(gg)
Having crafted some tools and gained more undestanding of the data, we will now move on to more exciting stuff!
Ehm.. so there is actually one more not-so-exciting thing left to do… data pre-processing, which includes filtering and normalization. But we are almost there!
Just as with the single cell data, we want to curate our data to make sure that it’s apt for the downstream analysis. The first step in this procedure is to remove bad “spots” - these could for example be spots which aren’t covered by the tissue (but still were - falsely - identified as such by the 10x software).
To get a better sense of what spots we should remove, we will look at the histograms of the meta data nCount_spRNA and nFeatures_spRNA. The former informs us of the total number of observed transcript in each spot, while the latter is telling of how many different genes that we observe in respective spot.
# Plot histograms over nCounts_spRNA and nFeatures_spRNA
# generate hisograom of nCount_spRNA
h1 <- ggplot(se@meta.data,
aes(x = nCount_spRNA)) +
geom_histogram(color = "black",
fill = "red",
alpha = 0.7
) +
geom_vline(xintercept = 7000,
color = "black",
linetype = "dashed"
)+
labs(title = "Histogram | Total Counts") +
ylab("No. Spots") +
xlab("No. Transcripts")
# generate hisograom of nFeature_spRNA
h2 <- ggplot(se@meta.data,
aes(x = nFeature_spRNA)) +
geom_histogram(color = "black",
fill = "blue"
)+
geom_vline(xintercept = 1500,
color = "black",
linetype = "dashed"
)+
labs(title = "Histogram | Features") +
ylab("No. Spots") +
xlab("No. Features")
h <- h1 - h2
print(h)
Dashed black lines have been added at the values 7000 (nCount_spRNA) and 1500 (nFeature_spRNA), these are the limits which we will settle on.
Q3: Create a vector named keep.spots which contains the names of all spots with both of the following criteria satisified:
nCount_spRNA should be larger or equal to \(7000\)nFeature_spRNA should be larger or equal to \(1500\)Comment : Do not make any changes to your Seurat object yet.
crit1 <- se[[]]$nCount_spRNA >= 7000
crit2 <- se[[]]$nFeature_spRNA >= 1500
keep.spots <- colnames(se)[crit1 & crit2]
Naturally, the next step is to identify which genes we want to keep and discard. To guide us in the procedure we will plot histograms of (i) the total number of UMI’s of each gene (taken over all spots) and (ii) the number of spots in which a gene is observed. We use a log10 transformation with a pseudocount of 1 for (i), otherwise the histogram becomes very distorted.
# compute total transcripts over all spots
nCount.genes <- Matrix::rowSums(GetAssayData(se))
# log transform total transcript count
nCount.genes.log10 <- log10(nCount.genes +1)
# compute number of spots each gene has been observed at
nObs.genes <- Matrix::rowSums(GetAssayData(se) > 0)
# construct new data frame with the information
# extracted above
gene.data <- data.frame(nCounts = nCount.genes.log10,
nObs = nObs.genes )
# generate histogram of total transcript counts
h3 <- ggplot(gene.data,
aes(x = nCounts)) +
geom_histogram(color = "black",
fill = "lightblue"
)+
labs(title = "i) Total Expression") +
ylab("No. Genes") +
xlab("Total UMI ")
# generate histogram of gene prevalance
h4 <- ggplot(gene.data,
aes(x = nObs)) +
geom_histogram(color = "black",
fill = "pink",
binwidth = 50
)+
labs(title = "ii) Prevalence") +
ylab("No. Genes") +
xlab("No. Spots")
h <- h3 + h4
print(h)
Q4 : We will apply a fairly stringent filtering; create an additional vector named keep.genes with the names of the genes that satisfy both the following criteria:
Comment : do not make any modifications to your Seurat object yet.
crit1 <- nCount.genes >= log10(300)
crit2 <- nObs.genes >= 5
keep.genes <- rownames(se)[crit1 & crit2]
We will actually add one more layer of filtering to our genes; namely that mitochondrial (MT) and ribosomal (RP) genes will also be removed. These usually have a very dominant and spurious expression profile, while being of little interest to our biological questions. Run the code chunk below to ensure MT and RP genes are excluded.
Comment : grepl is a function that uses regular expressions to match a certain pattern. The pattern we are using her ^(mt\\.|rp) translates to: “match any gene name that starts with either ‘mt.’ or ‘rp’”. Using the ! sign negates the match from grepl, meaning we only keep genes that do not start their names with “mt.” or “rp”.
# remove rp and mt genes
keep.genes <- keep.genes[!(grepl("^(mt\\.|rp)",tolower(keep.genes)))]
Q5: Subset your Seurat object se using the vectors keep.spots and keep.genes. The subsetted Seurat object should still be named se (i.e., you should overwrite the old one )
# enter code to subset your Seurat object here
se <- subset(se,
features = keep.genes,
cells = keep.spots )
se
## An object of class Seurat
## 17918 features across 2677 samples within 1 assay
## Active assay: spRNA (17918 features, 0 variable features)
Q6: Now please answer these to questions below:
ans.q5.1 <- NA
ans.q5.2 <- NA
cat(sprintf("Removed genes : %s\nRemoved spots %s",ans.q5.1,ans.q5.2))
## Removed genes : NA
## Removed spots NA
The final step before the actual analysis starts is to Normalize our data, the motivation behind this has already been outlined in the Single Cell lab, so we won’t ponder upon it here - but rather just apply it. As in the previous lab, the SCTransform function will be used for this purpose.
Comment: This might take some time, so don’t worry if you have to wait for a couplt of minutes!
se <- SCTransform(se,
assay = "spRNA",
verbose = FALSE)
se
## An object of class Seurat
## 35834 features across 2677 samples within 2 assays
## Active assay: SCT (17916 features, 3000 variable features)
## 1 other assay present: spRNA
Sweet! We are now done with the pre-processing and can finally start analyzing the data.
One good way to explore the inherent structure of our data is to cluster it, this allows us to assess how our spots relate to each other and whether there are any natural groupings contained within this data.
Clustering on the high dimensional data like ours, where each data point (spot) has more than \(18000\) features, is usually a bad idea. Most clustering algorithms use distances between data points to find the clusters, but as the number of dimensions grow, objects more or less become equidistant from each other. Thus the algorithms render poor results. This phenomena is known as the “curse of dimensionality”.
In an attempt to avoid this curse we usually apply some form om dimensionality reduction to our data, before clustering it. Here the choice of method in PCA (Principal Component Analysis), executed by using RunPCA. For purposes of visualization we also use UMAP (Uniform Manifold Approximation and Projection), this is done with the help of RunUMAP.
For the actual clustering we use the two functions FindNeighbors and FindClusters. The former constructs a SNN ( Shared Nearest Neighbor) graph which is used by the latter to actually identify the clusters.
Comment : You will see that we specify a seed in some of the function below, this puts or machine in a specific state that makes random numbers being generated the same in all instances, making the analysis reproducible.
# run pca on data
se <- RunPCA(se,verbose = FALSE)
# run umap on the 30 first principal vectors
se <- RunUMAP(se,
dims = 1:30,
verbose = FALSE,
seed.use = 1337
)
# construct a neighborhood
se <- FindNeighbors(se,
dims = 1:30,
verbose = FALSE)
# cluster data
se <- FindClusters(se,
verbose = FALSE,
resolution = 0.5,
random.seed = 1337
)
se
## An object of class Seurat
## 35834 features across 2677 samples within 2 assays
## Active assay: SCT (17916 features, 3000 variable features)
## 1 other assay present: spRNA
## 2 dimensional reductions calculated: pca, umap
To visualize the results in UMAP-space, we may use the DimPlot function.
# get the number of clusters
n_clusters = length(unique(se[[]]$SCT_snn_res.0.5))
# generate a color palette
color.palette <- rainbow(n_clusters)
# plot data in UMAP-space
DimPlot(se,label = TRUE,
cols = color.palette)
So far this is very similar to what we did with the single cell data, but what is really cool here is that we can see how these clusters are arranged in the real physical space. In order to do so, we use our spatial.plot function to color the spots by their cluster identity.
Comment : The cluster identities are stored in the metadata slot as SCT_snn_res.resolution, where resolution is equal to the parameter value used in ClusterData.
Comment : Since we are plotting labels (cluster identity) and not expression values, we set the color.type parameter of spatial.plot to categorical.
# plot clusters spatially
gg <- spatial.plot(se,
img = img,
color.vector = se[[]]$SCT_snn_res.0.5,
color.type = "categorical"
)
print(gg)
Take a moment to appreciate this, from the few lines of code that constitute our simple analysis of the gene expression data, we manage to capture real anatomical structures that would have taken years of research to define just 20 years ago. Pretty damn awesome, right?
But we are not done yet, now when we have these clusters to work with, the next step is to characterize them. One way of doing that is to conduct a DGE (Differential Gene Expression) analysis. We can again recycle some functions from the single cell lab, here FindAllMarkers, which will present us with marker genes for each cluster. We set the parameter logfc.threshold to 1 to ensure we only get genese that have a fairly high log-fold-change, the parameter only.pos is set to TRUE which means only upregulated (and not dowregulated) genes are returned. We also remove all genes where the adjusted p-value is above a significance threshold of 0.01 by setting the parameter return.thresh to 0.01.
# conduct a DGE analysis
de.markers <- FindAllMarkers(se,
assay = "SCT",
logfc.threshold = 1,
verbose = FALSE,
only.pos = TRUE,
return.thresh = 0.01
)
# inspect result
head(de.markers)
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster
## X3110035E14Rik 6.330698e-175 1.423414 0.995 0.630 1.134208e-170 0
## Tbr1 3.224842e-147 1.021565 0.952 0.387 5.777627e-143 0
## Trbc2 1.684573e-128 1.055687 0.696 0.182 3.018081e-124 0
## Nptxr 5.348052e-136 1.242834 1.000 0.925 9.581570e-132 1
## Slc30a3 2.794345e-110 1.144968 0.934 0.461 5.006349e-106 1
## Lypd1 1.683210e-98 1.562161 0.963 0.625 3.015640e-94 1
## gene
## X3110035E14Rik X3110035E14Rik
## Tbr1 Tbr1
## Trbc2 Trbc2
## Nptxr Nptxr
## Slc30a3 Slc30a3
## Lypd1 Lypd1
As a sort of sanity check, let us assess whether the marker genes’ expression overlap with the actual clusters they are supposed to represent. For a quick assessment, let us plot one marker gene associated with each cluster using the spatial.plot function.
# grab the name of the first marker gene for each cluster
de.markers.one <- de.markers[!duplicated(de.markers$cluster),]
# list to hold our plots
plot.list <- list()
# iterate over the selected marker genes
for (gene in de.markers.one$gene) {
# get the expression vector of the marker gene
expression.vector <- as.numeric(GetAssayData(se)[gene,])
# get cluster that the marker is associated with
cluster.id <- de.markers.one[gene,"cluster"]
# construct plot title
plot.title <- sprintf("Cluster %d | Gene : %s",cluster.id,gene)
# generate plot of marker gene
plot.list[[gene]] <- spatial.plot(se,
plot.title = plot.title,
color.vector = expression.vector
)
}
# arrange plots in grid and plot
grid.arrange(grobs = plot.list,
ncol = 4)
When clustering our spots, we will group together spots with similar expression profiles (they reside near each other in gene expression space). But remember how we in the introduction mentioned that each spot’s expression profile is a mixture of contributions from multiple cells. Thus, it is sometimes more interesting to know what the constituents of these mixtures are and how they are distributed within the data.
To approach this question, we can consider each spot as a combination of some latent - unobserved - factors. These factors, depending on how we design the model, could represent transcription programs or alternatively cell types. In contrast to a spot being assigned to a single cluster, it is defined by the composition of different contributions from respective factor. The process of retrieving and characterizing these factors, might be referred to as a form of decomposition or factorization.
Such factors can be learned from the data, but also defined a priori; a good example of the latter is when single cell data is integrated with spatial data (next section) - we can then consider the cell types associated expression profiles as latent factors. However, for now, we will focus on the case when we do not know the character of our factors prior to the analysis.
There is a plethora of different methods to conduct factor analysis or modeling of similar problems, some of them designed specifically with gene expression data in mind. Since more complex models tends to take longer time to fit and converge, we will use a relatively simple form of factorization called Latent Dirichlet Allocation (LDA) - borrowed from the field of topic modelling.
LDA was first presented in 2003 (Blei et al.), the original problem it was designed to solve can be described as:
“Given a corpus of 𝑀 documents, with a vocabulary consisting of 𝑉 words, assume each document can be assigned to one or more of 𝑇 topics. Each topic 𝑡 is characterized by its frequency distribution over the 𝑉 words. How are the words distributed across the topics and how are the topics distributed across the documents?”
If we consider this in the context of spatial transcriptomics, we can see how an alternative formulation could be stated as :
“Given a section with 𝑆 spots, with a total of 𝐺 genes being expressed across all spots, assume each spot can be assigned to one or more of 𝑃 expression programs. Each program 𝑝 is characterized by its frequency distribution over the 𝐺 genes. How are the genes distributed across the expression programs and how are the programs distributed across the spots?”
Despite being described as “simple”, the LDA model takes a fairly long time to fit and scales poorly with the number of genes, therefore we will only use the top 5000 highest expressed genes (based on the raw, not normalized data).
Q7: Create a vector named top.5000 that contains the names of the top 5000 genes w.r.t. total gene expression.
Comment: You want to use the assay “spRNA” and not “SCT” for this. This is because LDA operates with frequency distributions of words, computed from the number of times each word (here gene) occurs in every document (spot) - thus using normalized data wouldn’t make any sense.
# create the vector top.5000 here
top.5000 <- order(Matrix::rowSums(GetAssayData(se,assay ="spRNA")),
decreasing = TRUE)[1:5000]
top.5000 <- rownames(se)[top.5000]
head(top.5000)
## [1] "Oraov1" "Crisp1" "Rab3il1" "B4galt6" "Hbb.y" "Zfp516"
The next code chunk provides code that convert the data into a suitable format for the LDA implementation to use, and then fits the model. It uses \(10\) topics, this number is chosen quite arbitrarily.
Now, you can have a look at the code below but DO NOT RUN IT. The fitting procedure will take about 1 hour on a good computer, so to avoid you wasting valuable time on a slow algo - skip ahead to the next code chunk where you can load an already fitted model (using the code in this chunk).
# DO NOT R
UN
# Code to prepare data for the LDA algorithm
# and the actual fitting of the model
# construct a Term times Document Matrix from our count data
tdm <- as.TermDocumentMatrix(GetAssayData(subset(se,features = top.5000),
assay = "spRNA"),
weighting = "weightTf"
)
# get frequency weights
tdm <- weightTf(tdm)
# convert Term times Document to Document times Term Matrix
# desired input to LDA function
dtm <- as.DocumentTermMatrix(tdm)
# run LDA with 10 topics
lda.res <- LDA(dtm,
k = 10)
Run he code chunk below to load a fitted model
# load an already fitted model
lda.res <- readRDS("data/lda-res.Rds")
lda.res
## A LDA_VEM topic model with 10 topics.
So the model has now been fitted with \(10\) topics (gene expression programs). In contrast to the clustering - where each spot has one label (cluster id) - they now have a value for every topic representing the proportion of transcripts in a spot that originates from respective topic. These proportion values are found in the ot of the lda.res object.
To see how the topics are spread across our spots, we use spatial.plot function to visualize our results.
# list to store plots
plot.list <- list()
# iterate over each of the ten topics
for (topic in 1:10) {
# generate plot
plot.list[[topic]] <- spatial.plot(se,
color.vector = lda.res@gamma[,topic],
plot.title = paste("Topic",topic),
legend.title = "proportion"
)
}
# arrange plots in grid and plot
grid.arrange(grobs = plot.list,
ncol = 4)
Q8: Compute the rowsums of the gamma slot in lda.res, inspect your result and answer the following questions:
The rowsums are all identical, what is the value that all sums share?
Why do all rows sum to this value?
Give your answers as comments in the code chunk below, they should be short and concise, ot more tha a sentence or two.
# Answers to Question 7
# Question 7.1
#
#
# Question 7.2
#
#
Now what’s even neater with LDA is that in addition to topic distribution among spots, it also fits the word (gene) distribution among topics. Meaning that we can see which genes that are very frequently observed (and thus charateristic) within each topic. More precisly the beta slot of our lda.res object holds the logged (normal logarithm) probabilities of a word being sampled from a specific topic.
Topic 9 - concentrated to the CA 1-3 regions and the dentate gyrus - looks sort of interesting, let us therefore see what \(20\) genes that are most strongly associated with this topic.
# get the indices of the top 20 most frequent genes of topic 9
top.topic.9 <- order(lda.res@beta[9,],decreasing = T)[1:20]
# create a data frame with gene names and probabilities
top.topic.9 <- data.frame(gene = lda.res@terms[top.topic.9],
prob = exp(lda.res@beta[9,top.topic.9])
)
top.topic.9
## gene prob
## 1 Aldoa 0.018500509
## 2 Calm1 0.015223809
## 3 Atp1b1 0.012289859
## 4 Rtn1 0.011289292
## 5 Ppia 0.010404590
## 6 Slc17a7 0.010401670
## 7 Ncdn 0.008677380
## 8 Dnm1 0.008026410
## 9 App 0.007504871
## 10 Cox4i1 0.007453587
## 11 Ckb 0.007415438
## 12 Cox5b 0.007296034
## 13 Itm2c 0.005726321
## 14 Eef1a2 0.005192897
## 15 Grina 0.004887466
## 16 Tuba4a 0.004758032
## 17 Tuba1a 0.004489874
## 18 Eno2 0.004305474
## 19 Nptn 0.004277163
## 20 Atp1a3 0.004205735
If you were to run a pathway/enrichment analysis on this set of genes (i.e., looking what sort of biological functions and pathways that it can be associated with), we’d see terms like “synapses”,“cell junctions”,“dendritic spine”, “proton transmembrane transporter activity”, etc. This makes sense since the regions Topic 9 is most dominant in are heavily involved in the formation and retreival of new memories as well as processing of neuronal signals.
Q9 : Create a data frame called top.2.genes with the top 2 genes of each topic. This data frame should have the columns: “gene”,“prob” and “topic”, containing the following information:
Hint 1: the names of the genes can be accessed by lda.res@terms. As you saw in the code chunk above.
Hint 2: It might be a good idea to use a for loop here. However, it’s not a must, there are multiple ways of solving this exercise!
Hint 3: If you want append an element to a vector, you simply do my.vector <- c(my.vector,new.element)
n.topics <- 10
prob.vec <- c()
gene.vec <- c()
topic.vec <- c()
for (topic in 1:n.topics){
pos <- order(lda.res@beta[topic,],decreasing =T )[2]
gene.vec <- c(gene.vec,lda.res@terms[pos])
prob.vec <- c(prob.vec,exp(lda.res@beta[topic,pos]))
topic.vec <- c(topic.vec,rep(topic,2))
}
top.2 <- data.frame(gene = gene.vec,
prob = prob.vec,
topic = topic.vec)
print(top.2)
## gene prob topic
## 1 Cox4i1 0.01186895 1
## 2 Nnat 0.01304753 1
## 3 Calm1 0.01481850 2
## 4 Clu 0.02104224 2
## 5 Hba.a2 0.10563089 3
## 6 Pcsk1n 0.02522151 3
## 7 Cnp 0.02175152 4
## 8 Cst3 0.01325917 4
## 9 Calm1 0.01522381 5
## 10 Mt1 0.02338911 5
## 11 Cox4i1 0.01186895 6
## 12 Nnat 0.01304753 6
## 13 Calm1 0.01481850 7
## 14 Clu 0.02104224 7
## 15 Hba.a2 0.10563089 8
## 16 Pcsk1n 0.02522151 8
## 17 Cnp 0.02175152 9
## 18 Cst3 0.01325917 9
## 19 Calm1 0.01522381 10
## 20 Mt1 0.02338911 10
The idea of factor decompisition is attractive, but there are still some challenges to this approach. The factors (what we referred to as topics before, factor is a more general term), do not necessarily represent specific cell types and we have no immediate biological interpretation of them.
What if we wanted to where a specific type of Neurons or Astrocytes (common cell types in the brain) are located in our spatial data; then we would have to hope that some of our factors correspond to these types (which there is no guarantee of) and then try to figure out which factors that would be. Not the most robust approach, right?
A solution this problem is to use a form of guided factor decomposition, where we already know our factors but are interested how they are distributed across the spots. In the case of localizing cell types, these factors would be expression profiles from each individual cell type that we are interested in. But from where would be obtain such information? The answer comes in the form of a different type of data, namely single cell RNA-seq.
In single cell experiments, each data point corresponds to one individual cell (once duoublets have been removed). This means that if we assign an identity (e.g., cell type or cluster), we can learn the expresion profiles and then deconvolve our spatial data with them.
Multiple strategies exists to “map” single cell information onto the spatial data, but we will use output from a tool called stereoscope (from the paper “single-cell and spatial transcriptomics enables probabilistic inference of cell type topography” in Nature Communications Biology), the software is found at github. In short this method models both single cell and spatial data as Negative Binomial (NB) distributed, then for every gene infers the paramters of each cell type’s distribution, to finally find the optimal combination of cell types in each spot based on Maximum a Posteriori estimation.
To map cell types from single cell data onto the tissue, we need a good single cell data set which is at least somewhat similar to our spatial data, i.e. where we expect the same set of cell types to be present in both data sets.
Fortunately for us, there is a website called mousebrain.org where a lot of single cell data from the mouse brain is avialable. One of the single cell data sets (from the hippocampal region) is visualized (in gtSNE-space, similar to our UMAP-embedings). In the letter+number combinations, the letter indicates cell type identity of the cluster while the number indicates different subtypes of the broader cell type classes. The letters are to be interpreted accordingly:
Single Cell in gtSNE-space with annotated clusters
Now stereoscope is a probabilistic model that takes quite a lot of time to fit, and you usually need GPU resources for it to run in reasonable time. Therefore you will load the output of stereoscope when applying it to the aforemention single cell data set and the section
# Load stereoscope analysis results
# get data from file
stsc <- read.table("data/stereoscope-results.tsv",
sep = '\t',
row.names = 1,
header = 1)
# transpose to have features as rows and spots as columns
stsc <- t(stsc)
# ad a new assay called stereoscope to our se object
se[["stereoscope"]] <- CreateAssayObject(counts = stsc)
# set the default assay to stereoscope
DefaultAssay(se) <- "stereoscope"
# inspect data
GetAssayData(se)[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## 545.377x1766.777 568.853x1766.777 580.59x1787.021
## Astrocytes-14 3.715675e-02 1.341006e-10 2.014709e-10
## Astrocytes-40 1.543621e-10 2.563256e-04 4.897425e-02
## Astrocytes-41 1.543521e-10 1.767440e-02 4.421047e-03
## Astrocytes-42 2.756558e-02 2.639982e-02 2.029330e-10
## Astrocytes-44 1.538011e-10 4.156066e-02 2.017382e-10
## 475.291x1725.951 533.81x1746.364
## Astrocytes-14 2.440989e-10 3.685829e-02
## Astrocytes-40 8.774401e-02 5.441967e-02
## Astrocytes-41 2.444665e-10 2.187103e-10
## Astrocytes-42 1.981639e-02 2.206276e-10
## Astrocytes-44 2.436370e-10 2.182815e-10
The values in the output from stereoscope represent proportion values, that is: the proportion of cells at each spot that is estimated to belong to a certain cell type. The cell types present in our data are:
cat(paste(rownames(se), collapse = ", "))
## Astrocytes-14, Astrocytes-40, Astrocytes-41, Astrocytes-42, Astrocytes-44, Blood-73, Ependymal-47, Excluded-30, Excluded-38, Excluded-44, Excluded-6, Immune-14, Immune-32, Immune-34, Immune-35, Neurons-10, Neurons-11, Neurons-12, Neurons-14, Neurons-15, Neurons-16, Neurons-17, Neurons-18, Neurons-19, Neurons-20, Neurons-21, Neurons-22, Neurons-23, Neurons-24, Neurons-25, Neurons-26, Neurons-27, Neurons-28, Neurons-30, Neurons-48, Neurons-49, Neurons-51, Neurons-52, Neurons-54, Neurons-58, Neurons-59, Neurons-60, Neurons-61, Neurons-62, Neurons-63, Oligos-0, Oligos-1, Oligos-14, Oligos-5, Oligos-53, Vascular-14, Vascular-46, Vascular-67, Vascular-68, Vascular-69, Vascular-70
Now how about having a look at the spatial distribution of the celltypes:
our beloved spatial.plot comes in handy again
# specify cell types to plot
plot.types <- c("Ependymal-47",
"Neurons-27",
"Neurons-59")
# list to hold our plots
plot.list <- list()
# iterate over each specified cell type
for (cell.type in plot.types){
# get vector of cell type proportion values
color.vector <- GetAssayData(se)[cell.type,]
# generate plot
plot.list[[cell.type]] <- spatial.plot(se,
color.vector = color.vector,
plot.title = cell.type,
legend.title = "proportion"
)
}
# arrange plots in grid
grid.arrange(grobs = plot.list,
ncol = 3)
So now we know how these 3 cell types localize in the tissue, but with 56 types in total it can be a bit hard get a comprehensive overview of all the types. Especially if we want to start assess paterns of interaction and co-localization, which is often of great interest to know. To exemplify, if we were to plot every pair of cell types in order to see if they had similar spatial distributions, we have to look through \(56!/(54!2!) = 1540\) pairs… which is in the upper bounds of “quite a lot”.
In order to condense the rich information that the single cell mapping has given us, we can instead look at the (Pearson) correlation of the proportion values between different cell types. This will tell us which cell types that have similar spatial distributions (co-localizing) as well as which types that tend to populate different regions.
We will compute these correlation values - cast as a correlation matrix - and then visualize them in a heatmap.
# generate a color palette
color.pal <- rev(colorRampPalette(brewer.pal(11,"RdBu"))(nrow(se)))
# compute pearson correlation of cell types across spots
cors <- cors <- GetAssayData(se) %>%
as.matrix() %>%
t() %>%
cor()
# set diagonal to NA
diag(cors) <- NA
# generate heatmap and plot it
heatmap(cors,
col = color.pal,
scale = "none",
verbose = F,
breaks=seq(-1, 1, length.out=length(color.pal) +1)
)
From this we can see that the celltypes “Neurons-10” and “Neurons-15” seem to have a high degree of co-localization, among many other pairs. To assess whether this makes sense we can plot their spatial distribution using spatial.plot.
Q10: Plot the spatial distribution of the two cell types “Neurons-10” and “Neurons-15”, like we did for the cell types “Ependymal-47”, “Neurons-27” and “Neurons-59”; you are welcome to take inspiration from the code used before.
# Write the code to plot the spatial distribution of
# the cell types Neurons-10 and Neurons-15 here
plot.types <- c("Neurons-15","Neurons-10")
plot.list <- list()
for (cell.type in plot.types){
color.vector <- GetAssayData(se)[cell.type,]
plot.list[[cell.type]] <- spatial.plot(se,
color.vector = color.vector,
plot.title = cell.type,
legend.title = "proportion",
size = 1
)
}
grid.arrange(grobs = plot.list,
ncol = 2)
How : First, go to the top of this document and set GRADE_MODE = TRUE. Next, knit this document, and hand in the resulting html-file on Canvas. Make sure you have solved all the questions. If the document doesn’t compile, there is something wrong with the code.
Deadline: Your report is due 23:59 one week after the scheduled lab session; if working in pairs - each of you should hand in two (identical) reports where the names of both authors are clearly stated. For further information see the guidelines on the course web-page.